Eine Lebensmittelkette möchte die Qualität ihrer Weine verbessern, um ihren Kunden ein besseres Einkaufserlebnis zu bieten. Dazu soll ein maschinelles Lernmodell entwickelt werden, das aus physikalischen messbaren Merkmalen die Qualität der Weißweine bestimmt.
Das Ziel des Projekts ist es, ein Modell zu entwickeln, das mit einem score von mindestens 80 % gute Weißweine von anderen Weißweinen unterscheiden kann.
Für das Training von ML Modellen wird ein Datensatz von Weinen trainiert, der 12 Merkmale enthält:
1. Fixed Acidity (fester Säuregehalt): Die meisten Säuren im Wein sind feste oder nichtflüchtige Säuren (verdampfen nicht leicht). Der Säuregehalt ist eine Eigenschaft, die durch die Gesamtsumme der Säuren in einer Probe bestimmt wird. Wir können die Gesamtheit aller Säuren undifferenziert (Gesamtsäuregehalt) oder gruppiert (fester Säuregehalt und flüchtiger Säuregehalt) quantifizieren. Der feste Säuregehalt entspricht den schwerflüchtigen organischen Säuren wie Äpfelsäure, Milchsäure, Weinsäure oder Zitronensäure und ist von den Eigenschaften der Probe abhängig.
2. Volatile Acidity (flüchtiger Säuregehalt): Die Menge an Essigsäure im Wein, die in zu hoher Konzentration zu einem unangenehmen, essigartigen Geschmack führen kann. Der flüchtige Säuregehalt entspricht der Menge an kurzkettigen organischen Säuren, die durch Destillation aus der Probe gewonnen werden können: Ameisensäure, Essigsäure, Propionsäure und Buttersäure.
3. Citric Acid (Zitronensäure): Zitronensäure kommt in geringen Mengen vor und kann Weinen "Frische" und Aroma verleihen. Zitronensäure ist eine farblose schwache organische Säure. Sie kommt natürlich in Zitrusfrüchten vor. In der Biochemie ist sie ein Zwischenprodukt im Zitronensäurezyklus, der im Stoffwechsel aller aeroben Organismen abläuft.
4. Residual Sugar (Restzucker): Die Menge an Zucker, die nach Beendigung der Gärung verbleibt. Es ist selten, Weine mit weniger als 1 Gramm/Liter zu finden. Der Restzucker bezieht sich auf die Zucker, die in einem fertigen Wein unfermentiert geblieben sind. Er wird in Gramm Zucker pro Liter (g/l) gemessen. Die Menge des Restzuckers beeinflusst die Süße eines Weines und in der EU ist der RS-Gehalt an bestimmte Kennzeichnungstermini gebunden.
5. Chlorides (Chloride): Die Menge an Salz im Wein. Die höhere Extraktion von Chlorid bei der Rotweinherstellung ist auf die Ionen zurückzuführen, die während der Gärung aus den Schalen extrahiert werden. Daher sollte der rote Saft nicht mehr als 356 mg/l Chloridionen enthalten, damit der fertige Wein den maximal zulässigen Wert von 606 mg/l Chlorid nicht überschreitet (356 mg/l im roten Saft x 1,7 = 606).
6. Free Sulfur Dioxide (freies Schwefeldioxid): Die freie Form von SO2 existiert im Gleichgewicht zwischen molekularem SO2 (als gelöstes Gas) und Bisulfit-Ion. Was ist freies Schwefeldioxid im Wein? Die freien Sulfite sind diejenigen, die reagieren können und daher sowohl antimikrobielle als auch antioxidative Eigenschaften aufweisen. Die gebundenen Sulfite sind diejenigen, die mit anderen Molekülen im Weinmedium reagiert haben (sowohl reversibel als auch irreversibel). Die Summe der freien und gebundenen Sulfite definiert die Gesamtkonzentration an Sulfiten.
7. Total Sulfur Dioxide (gesamtes Schwefeldioxid): Menge an freier und gebundener Form von SO2; in niedrigen Konzentrationen ist SO2 im Wein meist nicht nachweisbar, aber bei freiem SO2. Einfach ausgedrückt ist Gesamtschwefeldioxid (TSO2) der Teil des SO2, der im Wein frei ist, plus dem Teil, der an andere Chemikalien im Wein gebunden ist, wie Aldehyde, Pigmente oder Zucker.
8. Density (Dichte): Die Dichte von Wasser liegt je nach Alkohol- und Zuckergehalt nahe der von Wasser. Wie misst man die Dichte von Wein? Ein Hydrometer ist ein Instrument zur Messung der Flüssigkeitsdichte. Es ist ein versiegeltes Glasrohr mit einer gewichteten Birne am einen Ende. Winzer verwenden dieses Instrument, um die Dichte von Saft, gärendem Wein und fertigem Wein im Verhältnis zu reinem Wasser zu messen. Dieses Verhältnis wird als spezifisches Gewicht (SG) bezeichnet.
9. pH-Wert: Beschreibt, wie sauer oder basisch ein Wein auf einer Skala von 0 (sehr sauer) bis 14 (sehr basisch) ist; die meisten Weine liegen zwischen 3 und 4. Was ist ein hoher pH-Wert im Wein? Weine mit höheren pH-Werten (>3,65) haben eine Reihe von potenziellen Herausforderungen während der Vinifikation und Alterung. Erstens haben Weine mit hohem pH-Wert ein erhöhtes Risiko für mikrobiellen Verderb. Traditionell wird Schwefeldioxid (oft in Form von Kaliummetabisulfit) verwendet,
10. Sulphates (Sulfite): Ein Weinzusatzstoff, der zur Schwefeldioxidgaskonzentration (SO2) beitragen kann, das als antimikrobielles Mittel wirkt. Weinsulfite kommen natürlich in allen Weinen in geringen Mengen vor und sind eines der Tausenden von chemischen Nebenprodukten, die während des Fermentationsprozesses entstehen. Sulfite werden jedoch auch vom Winzer hinzugefügt, um den Wein vor Bakterien und Hefebefall zu schützen. Für manche Menschen können Schwefelallergien mit Kopfschmerzen und verstopften Nebenhöhlen nach einem oder zwei Gläsern Wein in Verbindung gebracht werden.
11. Alcohol (Alkoholgehalt): Dies ist der Alkoholgehalt des Weins in Prozent.
12. Quality (Qualität): Ausgabevariable (basierend auf sensorischen Daten, Bewertung von 3 bis 9).
Modellbildung und Analyse
Ergebnisse und Interpretation
https://archive.ics.uci.edu/ml/datasets/wine+quality
Zitat:
# Benötigte Bibliotheken importieren
import numpy as np
import pandas as pd
import pickle
from pathlib import Path
import tarfile
import shutil
import urllib.request
import warnings
warnings.filterwarnings("ignore")
# Bibliotheken für die Bilderstellung
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
# Bibliotheken sklearn
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
import datetime as dt
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import cross_val_score, cross_validate
# Verzeichnis für Bilder (bei Bedarf)
IMAGES_PATH = Path() / "images"
IMAGES_PATH.mkdir(parents=True, exist_ok=True)
def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
path = IMAGES_PATH / f"{fig_id}.{fig_extension}"
if tight_layout:
plt.tight_layout()
plt.savefig(path, format=fig_extension, dpi=resolution)
def load_wine_data(url, ordner_name, csv_name):
ordner = Path(ordner_name)
zip_file_path = ordner / Path(url).name
if not zip_file_path.is_file():
ordner.mkdir(parents=True, exist_ok=True)
urllib.request.urlretrieve(url, zip_file_path)
destination_path = ordner / "wine_quality"
if not destination_path.is_dir():
destination_path.mkdir()
shutil.unpack_archive(zip_file_path, destination_path)
csv_path = destination_path / csv_name
if csv_path.is_file():
return pd.read_csv(csv_path, delimiter=";")
else:
return None
# Usage
df_w = load_wine_data(
url="https://archive.ics.uci.edu/static/public/186/wine+quality.zip",
ordner_name="datasets",
csv_name="winequality-white.csv"
)
# Alternativ: lokale csv-Dateien direkt laden
# (Die Internet-Quellen unterliegen nicht unserer Kontrolle und können entfallen/geändert werden)
#df_w = pd.read_csv("./Originaldatensatz/wine+quality/winequality-white.csv", delimiter=";")
print("Shape von df_w:", df_w.shape)
Shape von df_w: (4898, 12)
df_w.head(n = 10).style.background_gradient(cmap = 'Blues')
| fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.000000 | 0.270000 | 0.360000 | 20.700000 | 0.045000 | 45.000000 | 170.000000 | 1.001000 | 3.000000 | 0.450000 | 8.800000 | 6 |
| 1 | 6.300000 | 0.300000 | 0.340000 | 1.600000 | 0.049000 | 14.000000 | 132.000000 | 0.994000 | 3.300000 | 0.490000 | 9.500000 | 6 |
| 2 | 8.100000 | 0.280000 | 0.400000 | 6.900000 | 0.050000 | 30.000000 | 97.000000 | 0.995100 | 3.260000 | 0.440000 | 10.100000 | 6 |
| 3 | 7.200000 | 0.230000 | 0.320000 | 8.500000 | 0.058000 | 47.000000 | 186.000000 | 0.995600 | 3.190000 | 0.400000 | 9.900000 | 6 |
| 4 | 7.200000 | 0.230000 | 0.320000 | 8.500000 | 0.058000 | 47.000000 | 186.000000 | 0.995600 | 3.190000 | 0.400000 | 9.900000 | 6 |
| 5 | 8.100000 | 0.280000 | 0.400000 | 6.900000 | 0.050000 | 30.000000 | 97.000000 | 0.995100 | 3.260000 | 0.440000 | 10.100000 | 6 |
| 6 | 6.200000 | 0.320000 | 0.160000 | 7.000000 | 0.045000 | 30.000000 | 136.000000 | 0.994900 | 3.180000 | 0.470000 | 9.600000 | 6 |
| 7 | 7.000000 | 0.270000 | 0.360000 | 20.700000 | 0.045000 | 45.000000 | 170.000000 | 1.001000 | 3.000000 | 0.450000 | 8.800000 | 6 |
| 8 | 6.300000 | 0.300000 | 0.340000 | 1.600000 | 0.049000 | 14.000000 | 132.000000 | 0.994000 | 3.300000 | 0.490000 | 9.500000 | 6 |
| 9 | 8.100000 | 0.220000 | 0.430000 | 1.500000 | 0.044000 | 28.000000 | 129.000000 | 0.993800 | 3.220000 | 0.450000 | 11.000000 | 6 |
df_w.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4898 entries, 0 to 4897 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 fixed acidity 4898 non-null float64 1 volatile acidity 4898 non-null float64 2 citric acid 4898 non-null float64 3 residual sugar 4898 non-null float64 4 chlorides 4898 non-null float64 5 free sulfur dioxide 4898 non-null float64 6 total sulfur dioxide 4898 non-null float64 7 density 4898 non-null float64 8 pH 4898 non-null float64 9 sulphates 4898 non-null float64 10 alcohol 4898 non-null float64 11 quality 4898 non-null int64 dtypes: float64(11), int64(1) memory usage: 459.3 KB
Ergebnis
df_w.describe()
| fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 4898.000000 | 4898.000000 | 4898.000000 | 4898.000000 | 4898.000000 | 4898.000000 | 4898.000000 | 4898.000000 | 4898.000000 | 4898.000000 | 4898.000000 | 4898.000000 |
| mean | 6.854788 | 0.278241 | 0.334192 | 6.391415 | 0.045772 | 35.308085 | 138.360657 | 0.994027 | 3.188267 | 0.489847 | 10.514267 | 5.877909 |
| std | 0.843868 | 0.100795 | 0.121020 | 5.072058 | 0.021848 | 17.007137 | 42.498065 | 0.002991 | 0.151001 | 0.114126 | 1.230621 | 0.885639 |
| min | 3.800000 | 0.080000 | 0.000000 | 0.600000 | 0.009000 | 2.000000 | 9.000000 | 0.987110 | 2.720000 | 0.220000 | 8.000000 | 3.000000 |
| 25% | 6.300000 | 0.210000 | 0.270000 | 1.700000 | 0.036000 | 23.000000 | 108.000000 | 0.991723 | 3.090000 | 0.410000 | 9.500000 | 5.000000 |
| 50% | 6.800000 | 0.260000 | 0.320000 | 5.200000 | 0.043000 | 34.000000 | 134.000000 | 0.993740 | 3.180000 | 0.470000 | 10.400000 | 6.000000 |
| 75% | 7.300000 | 0.320000 | 0.390000 | 9.900000 | 0.050000 | 46.000000 | 167.000000 | 0.996100 | 3.280000 | 0.550000 | 11.400000 | 6.000000 |
| max | 14.200000 | 1.100000 | 1.660000 | 65.800000 | 0.346000 | 289.000000 | 440.000000 | 1.038980 | 3.820000 | 1.080000 | 14.200000 | 9.000000 |
# Spaltennamen überprüfen
df_w.columns
Index(['fixed_acidity', 'volatile_acidity', 'citric_acid', 'residual_sugar',
'chlorides', 'free_sulfur_dioxide', 'total_sulfur_dioxide', 'density',
'pH', 'sulphates', 'alcohol', 'quality'],
dtype='object')
# Umbenennen der Spalten um Leerzeichen zu entfernen
df_w.rename(lambda x: x.replace(' ','_'), axis=1, inplace=True)
# Spaltennamen noch mal überprüfen
df_w.columns
Index(['fixed_acidity', 'volatile_acidity', 'citric_acid', 'residual_sugar',
'chlorides', 'free_sulfur_dioxide', 'total_sulfur_dioxide', 'density',
'pH', 'sulphates', 'alcohol', 'quality'],
dtype='object')
# prüfen fehlende Werte
df_w.isnull().sum().sort_values(ascending=False)
fixed_acidity 0 volatile_acidity 0 citric_acid 0 residual_sugar 0 chlorides 0 free_sulfur_dioxide 0 total_sulfur_dioxide 0 density 0 pH 0 sulphates 0 alcohol 0 quality 0 dtype: int64
Ergebnis: Es gibt keine fehlende Werte im Datensatz.
# Alle Duplikate mit Ausnahme des ersten Vorkommens auflisten
df_w.loc[df_w.duplicated(),:]
| fixed_acidity | volatile_acidity | citric_acid | residual_sugar | chlorides | free_sulfur_dioxide | total_sulfur_dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4 | 7.2 | 0.23 | 0.32 | 8.5 | 0.058 | 47.0 | 186.0 | 0.99560 | 3.19 | 0.40 | 9.900000 | 6 |
| 5 | 8.1 | 0.28 | 0.40 | 6.9 | 0.050 | 30.0 | 97.0 | 0.99510 | 3.26 | 0.44 | 10.100000 | 6 |
| 7 | 7.0 | 0.27 | 0.36 | 20.7 | 0.045 | 45.0 | 170.0 | 1.00100 | 3.00 | 0.45 | 8.800000 | 6 |
| 8 | 6.3 | 0.30 | 0.34 | 1.6 | 0.049 | 14.0 | 132.0 | 0.99400 | 3.30 | 0.49 | 9.500000 | 6 |
| 20 | 6.2 | 0.66 | 0.48 | 1.2 | 0.029 | 29.0 | 75.0 | 0.98920 | 3.33 | 0.39 | 12.800000 | 8 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4828 | 6.4 | 0.23 | 0.35 | 10.3 | 0.042 | 54.0 | 140.0 | 0.99670 | 3.23 | 0.47 | 9.200000 | 5 |
| 4850 | 7.0 | 0.36 | 0.35 | 2.5 | 0.048 | 67.0 | 161.0 | 0.99146 | 3.05 | 0.56 | 11.100000 | 6 |
| 4851 | 6.4 | 0.33 | 0.44 | 8.9 | 0.055 | 52.0 | 164.0 | 0.99488 | 3.10 | 0.48 | 9.600000 | 5 |
| 4856 | 7.1 | 0.23 | 0.39 | 13.7 | 0.058 | 26.0 | 172.0 | 0.99755 | 2.90 | 0.46 | 9.000000 | 6 |
| 4880 | 6.6 | 0.34 | 0.40 | 8.1 | 0.046 | 68.0 | 170.0 | 0.99494 | 3.15 | 0.50 | 9.533333 | 6 |
937 rows × 12 columns
Ergebnis:
# Duplikate entfernen
df_w = df_w.drop_duplicates()
df_w.head()
| fixed_acidity | volatile_acidity | citric_acid | residual_sugar | chlorides | free_sulfur_dioxide | total_sulfur_dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.0 | 0.27 | 0.36 | 20.7 | 0.045 | 45.0 | 170.0 | 1.0010 | 3.00 | 0.45 | 8.8 | 6 |
| 1 | 6.3 | 0.30 | 0.34 | 1.6 | 0.049 | 14.0 | 132.0 | 0.9940 | 3.30 | 0.49 | 9.5 | 6 |
| 2 | 8.1 | 0.28 | 0.40 | 6.9 | 0.050 | 30.0 | 97.0 | 0.9951 | 3.26 | 0.44 | 10.1 | 6 |
| 3 | 7.2 | 0.23 | 0.32 | 8.5 | 0.058 | 47.0 | 186.0 | 0.9956 | 3.19 | 0.40 | 9.9 | 6 |
| 6 | 6.2 | 0.32 | 0.16 | 7.0 | 0.045 | 30.0 | 136.0 | 0.9949 | 3.18 | 0.47 | 9.6 | 6 |
# Bereinigten Datensatz prüfen:
df_w.info()
<class 'pandas.core.frame.DataFrame'> Index: 3961 entries, 0 to 4897 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 fixed_acidity 3961 non-null float64 1 volatile_acidity 3961 non-null float64 2 citric_acid 3961 non-null float64 3 residual_sugar 3961 non-null float64 4 chlorides 3961 non-null float64 5 free_sulfur_dioxide 3961 non-null float64 6 total_sulfur_dioxide 3961 non-null float64 7 density 3961 non-null float64 8 pH 3961 non-null float64 9 sulphates 3961 non-null float64 10 alcohol 3961 non-null float64 11 quality 3961 non-null int64 dtypes: float64(11), int64(1) memory usage: 402.3 KB
Ergebnisse der Duplikate Prüfung & Bereinigung:
sns.pairplot(df_w, diag_kind = "hist", hue = "quality", height = 3, aspect = 1, corner = True);
Ergebnis pairplot:
# Subplot mit 4 Zeilen und 3 Spalten erstellen
fig = make_subplots(rows=4, cols=3, subplot_titles=df_w.columns)
# Histogramme zu Subplot hinzufügen
for i, col in enumerate(df_w.columns):
row = (i // 3) + 1
col_num = (i % 3) + 1
histogram = go.Histogram(x=df_w[col], name=col, nbinsx=80, marker=dict(line=dict(color='black', width=1)))
fig.add_trace(histogram, row=row, col=col_num)
# Layout aktualisieren
fig.update_layout(title="Histogramme der Spalten", title_x=0.5, title_y=0.95, title_font=dict(size=24, color='black')
, showlegend=False
, plot_bgcolor='rgba(0,0,0,0)'
, height=700
)
# Abbildung speichern
save_fig("Histogramme der Spalten")
# Plot anzeigen
fig.show()
<Figure size 640x480 with 0 Axes>
Ergebnisse Prüfung Histogramme:
# residual_sugar: sehr viele Werte zwischen 1 und 1.4
df_w['residual_sugar'].value_counts()
residual_sugar
1.40 165
1.20 165
1.60 144
1.30 134
1.10 126
...
8.45 1
0.60 1
16.60 1
18.30 1
18.40 1
Name: count, Length: 310, dtype: int64
# individuelle vertikale Boxplots mit separaten Y-Koordinaten erstellen
fig, axes = plt.subplots(1, 12, figsize=(10, 5), sharey=False)
for i, column_name in enumerate(df_w.columns):
sns.boxplot(y=df_w[column_name], ax=axes[i], color='green', orient='v')
axes[i].set_ylabel(None)
axes[i].text(0.5, -0.3, column_name, transform=axes[i].transAxes,
rotation=90, ha='center', fontsize=9, color='darkblue')
plt.tight_layout()
# Abbildung speichern
save_fig("Boxplots aller Features")
# Plot anzeigen
plt.show()
Ergebnisse der Ausreißeruntersuchung
quality = df_w['quality'].value_counts()
quality
quality 6 1788 5 1175 7 689 4 153 8 131 3 20 9 5 Name: count, dtype: int64
# Qualität und Anzahl der jeweiligen Qualitätswerte ermitteln und sortieren
quality_counts = df_w['quality'].value_counts().sort_index().reset_index()
quality_counts
| quality | count | |
|---|---|---|
| 0 | 3 | 20 |
| 1 | 4 | 153 |
| 2 | 5 | 1175 |
| 3 | 6 | 1788 |
| 4 | 7 | 689 |
| 5 | 8 | 131 |
| 6 | 9 | 5 |
# Barplot erstellen
fig_country = px.bar(x=quality.index,
y=quality.values,
labels={'x':'Qualität','y':'Anzahl der Varianten'},
title='Qualität der Weisswein Varianten'
)
fig_country.update_layout(plot_bgcolor='rgba(0,0,0,0)', title=dict(x=0.48))
# Abbildung speichern
save_fig("Qualität der Weisswein Varianten")
# Plot anzeigen
fig_country.show()
<Figure size 640x480 with 0 Axes>
# Extrahiere die numerischen Korrelationskoeffizienten
correlation_matrix = df_w.corr(numeric_only=True)
# Erzeuge die Heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5, annot_kws={"size": 9})
plt.title("Korrelationsmatrix Heatmap für Weißwein")
plt.show()
Korrelationen Weißwein:
Sehr hoch für:
Hoch für:
# erstellen Scatterplot
fig_scatter = px.scatter(data_frame=df_w,
x='density',
y='residual_sugar',
title="Weisswein: 'density' vs. 'residual_sugar",
color="quality",
opacity=0.3
)
# aktualisieren Layout
fig_scatter.update_layout(title=dict(x=0.5),
title_font=dict(size=20, color='black'),
height=600,
#plot_bgcolor='rgba(0,0,0,0)'
)
# Abbildung speichern
save_fig("Weisswein: 'density' vs. 'residual_sugar")
# zeigen die Figur an
fig_scatter.show()
<Figure size 640x480 with 0 Axes>
# erstellen Scatterplot
fig_scatter = px.scatter(data_frame=df_w,
x='density',
y='alcohol',
title="Weisswein: 'density' vs. 'alcohol'",
color="quality",
opacity=0.3
)
# aktualisieren Layout
fig_scatter.update_layout(title=dict(x=0.5),
title_font=dict(size=20, color='black'),
height=600,
#plot_bgcolor='rgba(0,0,0,0)'
)
# Abbildung speichern
save_fig("Weisswein: 'density' vs. 'alcohol'")
# zeigen die Figur an
fig_scatter.show()
<Figure size 640x480 with 0 Axes>
Ergebnis Scatterplots:
# Berechnen des 75. Perzentils (Q3) für alle Spalten außer "quality"
Q3_values = df_w.loc[:, df_w.columns != 'quality'].quantile(0.75)
# Berechnen der Ausreißer-Schwelle (3*Q3) für alle Spalten
outlier_thresholds = 3 * Q3_values
# Erstellen eines booleschen DataFrames, das True für Ausreißer und False für nicht-Ausreißer enthält
outliers = (df_w.loc[:, df_w.columns != 'quality'] > outlier_thresholds)
# Filtern des ursprünglichen DataFrames basierend auf den Ausreißern in jeder Spalte
df_w_cleaned = df_w[~outliers.any(axis=1)]
# Wie ist die Anzahl der gelöschten Zeilen nach "quality" gruppiert?
# Erstellen eines DataFrames für die Ausreißer
df_outliers = df_w[outliers.any(axis=1)]
# Gruppieren der Ausreißer nach "quality" und Anzeigen der Anzahl in jeder Gruppe
outliers_grouped = df_outliers.groupby('quality').size().reset_index(name='count')
print("\nGruppierte Ausreißer nach 'quality':")
print(outliers_grouped)
# Erstellen eines DataFrames für die verbleibenden Zeilen in df_w_cleaned
#df_remaining = df_w_cleaned[~outliers.any(axis=1)]
# Gruppieren der verbleibenden Zeilen nach "quality" und Anzeigen der Anzahl in jeder Gruppe
remaining_grouped = df_w_cleaned.groupby('quality').size().reset_index(name='count')
print("\nGruppierte verbleibende Zeilen nach 'quality':")
print(remaining_grouped)
Gruppierte Ausreißer nach 'quality': quality count 0 3 3 1 4 4 2 5 29 3 6 28 Gruppierte verbleibende Zeilen nach 'quality': quality count 0 3 17 1 4 149 2 5 1146 3 6 1760 4 7 689 5 8 131 6 9 5
Ergebnis der Entfernung der Ausreißer:
# Bereinigten Datensatz prüfen
df_w_cleaned.describe()
| fixed_acidity | volatile_acidity | citric_acid | residual_sugar | chlorides | free_sulfur_dioxide | total_sulfur_dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 3897.000000 | 3897.000000 | 3897.000000 | 3897.000000 | 3897.000000 | 3897.000000 | 3897.000000 | 3897.000000 | 3897.000000 | 3897.000000 | 3897.000000 | 3897.000000 |
| mean | 6.839325 | 0.279106 | 0.331989 | 5.913459 | 0.043848 | 34.690018 | 136.814344 | 0.993763 | 3.196823 | 0.490549 | 10.603143 | 5.864255 |
| std | 0.869122 | 0.100943 | 0.117452 | 4.763448 | 0.014643 | 16.489185 | 42.650732 | 0.002810 | 0.151563 | 0.113813 | 1.216054 | 0.889303 |
| min | 3.800000 | 0.080000 | 0.000000 | 0.600000 | 0.009000 | 2.000000 | 9.000000 | 0.987110 | 2.720000 | 0.220000 | 8.000000 | 3.000000 |
| 25% | 6.300000 | 0.210000 | 0.270000 | 1.600000 | 0.035000 | 23.000000 | 106.000000 | 0.991600 | 3.100000 | 0.410000 | 9.500000 | 5.000000 |
| 50% | 6.800000 | 0.260000 | 0.320000 | 4.700000 | 0.042000 | 33.000000 | 132.000000 | 0.993440 | 3.190000 | 0.480000 | 10.500000 | 6.000000 |
| 75% | 7.300000 | 0.320000 | 0.380000 | 8.900000 | 0.050000 | 45.000000 | 166.000000 | 0.995710 | 3.290000 | 0.550000 | 11.400000 | 6.000000 |
| max | 14.200000 | 0.930000 | 1.000000 | 26.050000 | 0.150000 | 131.000000 | 366.500000 | 1.002950 | 3.820000 | 1.080000 | 14.200000 | 9.000000 |
df_w_cleaned.to_pickle("df_w_cleaned.pkl")
# pickle Datensatz laden in array
# Der Name sollte einheitlich sein, wie hier z.B. df_w_ml, aber jeder kann temporär für das Projekt einen Namen ( außer df_w ) wählen damit es vorerst keine Interferenzen gibt
df_w_ml = pd.read_pickle("df_w_cleaned.pkl")
# Definieren einer Funktion, um die Qualität in drei Kategorien zu unterteilen
def quality_to_category(quality):
if quality in [3, 4]:
return 0
elif quality in [5, 6]:
return 1
else:
return 2
# Erstellen einer Kopie von df_w_ml
df_w_ml3 = df_w_ml.copy()
# Anwenden der Funktion quality_to_category auf die Spalte 'quality'
df_w_ml3['quality'] = df_w_ml3['quality'].apply(quality_to_category)
## Umsetzung prüfen
df_w_ml3.describe()
| fixed_acidity | volatile_acidity | citric_acid | residual_sugar | chlorides | free_sulfur_dioxide | total_sulfur_dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 3897.000000 | 3897.000000 | 3897.000000 | 3897.000000 | 3897.000000 | 3897.000000 | 3897.000000 | 3897.000000 | 3897.000000 | 3897.000000 | 3897.000000 | 3897.000000 |
| mean | 6.839325 | 0.279106 | 0.331989 | 5.913459 | 0.043848 | 34.690018 | 136.814344 | 0.993763 | 3.196823 | 0.490549 | 10.603143 | 1.169104 |
| std | 0.869122 | 0.100943 | 0.117452 | 4.763448 | 0.014643 | 16.489185 | 42.650732 | 0.002810 | 0.151563 | 0.113813 | 1.216054 | 0.475142 |
| min | 3.800000 | 0.080000 | 0.000000 | 0.600000 | 0.009000 | 2.000000 | 9.000000 | 0.987110 | 2.720000 | 0.220000 | 8.000000 | 0.000000 |
| 25% | 6.300000 | 0.210000 | 0.270000 | 1.600000 | 0.035000 | 23.000000 | 106.000000 | 0.991600 | 3.100000 | 0.410000 | 9.500000 | 1.000000 |
| 50% | 6.800000 | 0.260000 | 0.320000 | 4.700000 | 0.042000 | 33.000000 | 132.000000 | 0.993440 | 3.190000 | 0.480000 | 10.500000 | 1.000000 |
| 75% | 7.300000 | 0.320000 | 0.380000 | 8.900000 | 0.050000 | 45.000000 | 166.000000 | 0.995710 | 3.290000 | 0.550000 | 11.400000 | 1.000000 |
| max | 14.200000 | 0.930000 | 1.000000 | 26.050000 | 0.150000 | 131.000000 | 366.500000 | 1.002950 | 3.820000 | 1.080000 | 14.200000 | 2.000000 |
df_w_ml3.to_pickle("df_w_ml3.pkl")
Wie bereits erwähnt, gibt es einige Korrelationen zwischen Features in unserem Datensatz. Daher ist es ein Standardprozess, eine PCA durchzuführen, um die wichtigsten Komponenten zu finden.
Zuvor müssen wir die Daten skalieren. Die verwendete Methode ist Standardscaler, da durch die Standardisierung sichergestellt wird, dass jedes Feature gleichermaßen zur von PCA erfassten Varianz beiträgt.
df_w_ml3 = pd.read_pickle("df_w_ml3.pkl")
# Features und Target Variable trennen
X = df_w_ml3.drop('quality', axis=1)
y = df_w_ml3['quality']
# Standardskalierung durchführen
sc = StandardScaler()
X_trans = sc.fit_transform(X)
X_trans[:3]
array([[ 0.18489414, -0.09021826, 0.23852183, 3.10456569, 0.07865777,
0.62533745, 0.77817924, 2.57566243, -1.29878878, -0.35632442,
-1.48297162],
[-0.62062007, 0.20701765, 0.06821777, -0.90564907, 0.3518568 ,
-1.25492392, -0.11289284, 0.08419072, 0.68083882, -0.00482556,
-0.90726555],
[ 1.45070219, 0.00886038, 0.57912995, 0.20713304, 0.42015656,
-0.28446644, -0.93361712, 0.4757077 , 0.41688848, -0.44419913,
-0.4138032 ]])
# PCA durchführen
pca = PCA()
X_pca = pca.fit_transform(X_trans)
pca.explained_variance_ratio_
array([0.30064137, 0.14514628, 0.11064042, 0.09595564, 0.08891342,
0.07549592, 0.06463037, 0.05351846, 0.03761823, 0.02598582,
0.00145407])
# kumulierte Summe von explained_variance_ratio_
cumulative_explained_variance = np.cumsum(pca.explained_variance_ratio_)
cumulative_explained_variance
array([0.30064137, 0.44578764, 0.55642806, 0.65238371, 0.74129713,
0.81679306, 0.88142342, 0.93494189, 0.97256011, 0.99854593,
1. ])
# DataFrame erstellen
df_pca = pd.DataFrame({
'Principal Components': np.arange(1, len(cumulative_explained_variance) + 1),
'Cumulative Explained Variance': cumulative_explained_variance
})
# Plot erstellen
fig = px.line(df_pca, x='Principal Components', y='Cumulative Explained Variance', markers=True)
# Layout aktualisieren
fig.update_layout(
title='Cumulative Explained Variance Plot',
xaxis_title='Principal Components',
yaxis_title='Cumulative Explained Variance',
title_font=dict(size=20, color='black')
)
# Abbildung speichern
save_fig("Cumulative Explained Variance Plot")
# Plot anzeigen
fig.show()
<Figure size 640x480 with 0 Axes>
Ergebnis: Aus der Abbildung geht hervor, dass bereits mit neun Hauptkomponenten etwas über 97% "kumulative erklärte Varianz" erreicht wird.
pca_new = PCA(n_components=9)
X_new = pca_new.fit_transform(X_trans)
X_trans[:3]
array([[ 0.18489414, -0.09021826, 0.23852183, 3.10456569, 0.07865777,
0.62533745, 0.77817924, 2.57566243, -1.29878878, -0.35632442,
-1.48297162],
[-0.62062007, 0.20701765, 0.06821777, -0.90564907, 0.3518568 ,
-1.25492392, -0.11289284, 0.08419072, 0.68083882, -0.00482556,
-0.90726555],
[ 1.45070219, 0.00886038, 0.57912995, 0.20713304, 0.42015656,
-0.28446644, -0.93361712, 0.4757077 , 0.41688848, -0.44419913,
-0.4138032 ]])
X_new[:3]
array([[ 3.94106075, 0.59600194, 0.96127829, -0.85436533, -0.46998408,
-1.59941421, -0.06479839, -1.01210124, 0.14371296],
[-0.43329731, -0.52337733, 0.33880695, 1.44274863, -0.31938373,
-0.08973335, 0.52670637, 0.32024244, -0.99983167],
[ 0.37721191, 1.15366837, 0.16577858, 0.75471802, -0.34894133,
-0.39462469, 0.59847843, 0.77820374, 0.53271563]])
X_pca = pd.DataFrame(X_new)
X_pca.head()
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 3.941061 | 0.596002 | 0.961278 | -0.854365 | -0.469984 | -1.599414 | -0.064798 | -1.012101 | 0.143713 |
| 1 | -0.433297 | -0.523377 | 0.338807 | 1.442749 | -0.319384 | -0.089733 | 0.526706 | 0.320242 | -0.999832 |
| 2 | 0.377212 | 1.153668 | 0.165779 | 0.754718 | -0.348941 | -0.394625 | 0.598478 | 0.778204 | 0.532716 |
| 3 | 1.768247 | -0.058851 | 0.058040 | -0.167592 | -0.847375 | 0.673288 | -0.191817 | 0.452547 | 0.445519 |
| 4 | 0.256925 | -0.952377 | 1.318177 | 0.237570 | -0.364846 | -0.416277 | -0.777440 | -0.126311 | -0.423513 |
X_pca.to_pickle("X_pca.pkl")
Logaritmische Regression mit skalierte PCA
Mit skalierte PCA wird die Berechnung schneller als mit Rohdaten.
# get data
X_pca = pd.read_pickle("X_pca.pkl")
# print(len(X_pca))
df_w_ml3 = pd.read_pickle("df_w_ml3.pkl")
y = df_w_ml3["quality"]
# print(len(y))
# X_pca as feature/input
# y as output
# train and test split
test_size_ratio = 0.3
X_logreg_train, X_logreg_test, y_logreg_train, y_logreg_test = train_test_split(X_pca, y, test_size=test_size_ratio, random_state=42)
# data describe, if needed
# X_logreg_train.describe()
# X_logreg_test.describe()
# y_logreg_train.describe()
# y_logreg_test.describe()
# logistic regression model
log_reg = LogisticRegression(max_iter=50_000)
ticktime = dt.datetime.now()
log_reg.fit(X_logreg_train, y_logreg_train)
log_reg_duration = dt.datetime.now() - ticktime
log_reg
LogisticRegression(max_iter=50000)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression(max_iter=50000)
# scoring
log_reg_train_score = log_reg.score(X_logreg_train, y_logreg_train)
log_reg_test_score = log_reg.score(X_logreg_test, y_logreg_test)
print(f"logistic regression training score: {log_reg_train_score} ")
print(f"logistic regression test score: {log_reg_test_score} ")
print()
print(f"logistic regression duration: {log_reg_duration}")
logistic regression training score: 0.768976897689769 logistic regression test score: 0.7547008547008547 logistic regression duration: 0:00:00.016955
y_logreg_pred = log_reg.predict(X_logreg_test)
# confusion matrix
log_reg_conf_matrix = confusion_matrix(y_logreg_test, y_logreg_pred)
log_reg_conf_matrix
# Erzeuge die Heatmap
# plt.figure(figsize=(10, 8))
# sns.heatmap(log_reg_conf_matrix, annot=True, cmap='coolwarm', linewidths=0.5, annot_kws={"size": 9})
# plt.title("Confusion Matrix Heatmap: Weisswein")
# plt.show()
array([[ 0, 52, 1],
[ 0, 802, 65],
[ 0, 169, 81]], dtype=int64)
# classification_report
target_names = ["schlechte Wein", "typische Wein", "gute Wein"]
print(classification_report(y_logreg_test, y_logreg_pred, target_names=target_names))
precision recall f1-score support
schlechte Wein 0.00 0.00 0.00 53
typische Wein 0.78 0.93 0.85 867
gute Wein 0.55 0.32 0.41 250
accuracy 0.75 1170
macro avg 0.44 0.42 0.42 1170
weighted avg 0.70 0.75 0.72 1170
# cross validation score
log_reg_cross_val_train_score = cross_val_score(log_reg, X_logreg_train, y_logreg_train, cv=7, scoring="precision_macro")
print("logistic regression cross-validation training score: \n", log_reg_cross_val_train_score, type(log_reg_cross_val_train_score))
print()
print("mean: ", log_reg_cross_val_train_score.mean())
print("standard deviation: ", log_reg_cross_val_train_score.std())
print("min: ", log_reg_cross_val_train_score.min())
print("max: ", log_reg_cross_val_train_score.max())
logistic regression cross-validation training score: [0.7736703 0.46285714 0.45615848 0.46766169 0.8331978 0.5751794 0.44365393] <class 'numpy.ndarray'> mean: 0.5731969625096792 standard deviation: 0.15193320541597463 min: 0.44365393061045233 max: 0.8331977952471311
# cross validation detailed
log_reg_cross_val_train_score_1 = cross_validate(log_reg, X_logreg_train, y_logreg_train, cv=7, scoring="precision_macro")
# print(log_reg_cross_val_train_score_1)
log_reg_cross_val_train_score_1
{'fit_time': array([0.00998545, 0.00997496, 0.00897312, 0.00997281, 0.00999737,
0.00997496, 0.00900722]),
'score_time': array([0.00296712, 0.0030036 , 0.0030179 , 0.00296831, 0.00199366,
0.00196934, 0.00199676]),
'test_score': array([0.7736703 , 0.46285714, 0.45615848, 0.46766169, 0.8331978 ,
0.5751794 , 0.44365393])}
# grid search with cross validation
parameters = [
{"solver": ["lbfgs"], "penalty": ["l2"], "multi_class": ["ovr", "multinomial"]},
{"solver": ["saga"], "penalty": ["l2", "l1"], "multi_class": ["ovr", "multinomial"]}
]
log_reg_grid = GridSearchCV(log_reg, parameters, cv=7, scoring="precision_macro")
log_reg_grid
GridSearchCV(cv=7, estimator=LogisticRegression(max_iter=50000),
param_grid=[{'multi_class': ['ovr', 'multinomial'],
'penalty': ['l2'], 'solver': ['lbfgs']},
{'multi_class': ['ovr', 'multinomial'],
'penalty': ['l2', 'l1'], 'solver': ['saga']}],
scoring='precision_macro')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=7, estimator=LogisticRegression(max_iter=50000),
param_grid=[{'multi_class': ['ovr', 'multinomial'],
'penalty': ['l2'], 'solver': ['lbfgs']},
{'multi_class': ['ovr', 'multinomial'],
'penalty': ['l2', 'l1'], 'solver': ['saga']}],
scoring='precision_macro')LogisticRegression(max_iter=50000)
LogisticRegression(max_iter=50000)
# learning
fitting = log_reg_grid.fit(X_logreg_train, y_logreg_train)
fitting
GridSearchCV(cv=7, estimator=LogisticRegression(max_iter=50000),
param_grid=[{'multi_class': ['ovr', 'multinomial'],
'penalty': ['l2'], 'solver': ['lbfgs']},
{'multi_class': ['ovr', 'multinomial'],
'penalty': ['l2', 'l1'], 'solver': ['saga']}],
scoring='precision_macro')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=7, estimator=LogisticRegression(max_iter=50000),
param_grid=[{'multi_class': ['ovr', 'multinomial'],
'penalty': ['l2'], 'solver': ['lbfgs']},
{'multi_class': ['ovr', 'multinomial'],
'penalty': ['l2', 'l1'], 'solver': ['saga']}],
scoring='precision_macro')LogisticRegression(max_iter=50000)
LogisticRegression(max_iter=50000)
# GridSearch: detailierte Ergebnis
log_reg_grid.cv_results_
{'mean_fit_time': array([0.01396009, 0.01131558, 0.02114476, 0.02226356, 0.01265097,
0.02985239]),
'std_fit_time': array([0.0024015 , 0.00190692, 0.00456639, 0.00139714, 0.00056488,
0.00256359]),
'mean_score_time': array([0.0035617 , 0.00249478, 0.00292325, 0.00258013, 0.00270765,
0.00260101]),
'std_score_time': array([0.0009011 , 0.00046164, 0.0006781 , 0.00049626, 0.00035034,
0.0004499 ]),
'param_multi_class': masked_array(data=['ovr', 'multinomial', 'ovr', 'ovr', 'multinomial',
'multinomial'],
mask=[False, False, False, False, False, False],
fill_value='?',
dtype=object),
'param_penalty': masked_array(data=['l2', 'l2', 'l2', 'l1', 'l2', 'l1'],
mask=[False, False, False, False, False, False],
fill_value='?',
dtype=object),
'param_solver': masked_array(data=['lbfgs', 'lbfgs', 'saga', 'saga', 'saga', 'saga'],
mask=[False, False, False, False, False, False],
fill_value='?',
dtype=object),
'params': [{'multi_class': 'ovr', 'penalty': 'l2', 'solver': 'lbfgs'},
{'multi_class': 'multinomial', 'penalty': 'l2', 'solver': 'lbfgs'},
{'multi_class': 'ovr', 'penalty': 'l2', 'solver': 'saga'},
{'multi_class': 'ovr', 'penalty': 'l1', 'solver': 'saga'},
{'multi_class': 'multinomial', 'penalty': 'l2', 'solver': 'saga'},
{'multi_class': 'multinomial', 'penalty': 'l1', 'solver': 'saga'}],
'split0_test_score': array([0.44403394, 0.7736703 , 0.44403394, 0.44403394, 0.7736703 ,
0.7736703 ]),
'split1_test_score': array([0.47378389, 0.46285714, 0.47378389, 0.47967231, 0.46285714,
0.46285714]),
'split2_test_score': array([0.4647343 , 0.45615848, 0.4647343 , 0.4647343 , 0.45615848,
0.45615848]),
'split3_test_score': array([0.48717949, 0.46766169, 0.48717949, 0.48717949, 0.46766169,
0.47156085]),
'split4_test_score': array([0.49914388, 0.8331978 , 0.49914388, 0.49914388, 0.8331978 ,
0.84127871]),
'split5_test_score': array([0.4751955 , 0.5751794 , 0.4751955 , 0.4751955 , 0.5751794 ,
0.57969594]),
'split6_test_score': array([0.45368464, 0.44365393, 0.45368464, 0.45368464, 0.44365393,
0.44365393]),
'mean_test_score': array([0.47110795, 0.57319696, 0.47110795, 0.47194915, 0.57319696,
0.57555362]),
'std_test_score': array([0.01750663, 0.15193321, 0.01750663, 0.01775471, 0.15193321,
0.15355079]),
'rank_test_score': array([5, 2, 5, 4, 2, 1])}
# bestes Model
best_log_reg = log_reg_grid.best_estimator_
best_log_reg
LogisticRegression(max_iter=50000, multi_class='multinomial', penalty='l1',
solver='saga')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. LogisticRegression(max_iter=50000, multi_class='multinomial', penalty='l1',
solver='saga')best_training_score = best_log_reg.score(X_logreg_train, y_logreg_train)
best_test_score = best_log_reg.score(X_logreg_test, y_logreg_test)
print("best training score: ", best_training_score)
print("best test score: ", best_test_score)
best training score: 0.7686101943527686 best test score: 0.7547008547008547
y_logreg_best_pred = best_log_reg.predict(X_logreg_test)
# confusion matrix
confusion_matrix(y_logreg_test, y_logreg_best_pred)
array([[ 0, 52, 1],
[ 0, 802, 65],
[ 0, 169, 81]], dtype=int64)
# classification_report
target_names = ["schlechte Wein", "typische Wein", "gute Wein"]
print(classification_report(y_logreg_test, y_logreg_best_pred, target_names=target_names))
precision recall f1-score support
schlechte Wein 0.00 0.00 0.00 53
typische Wein 0.78 0.93 0.85 867
gute Wein 0.55 0.32 0.41 250
accuracy 0.75 1170
macro avg 0.44 0.42 0.42 1170
weighted avg 0.70 0.75 0.72 1170
Logaritmische Regression mit skalierte PCA
Mit skalierte PCA wird die Berechnung schneller als mit Rohdaten.
# get data
X_pca = pd.read_pickle("X_pca.pkl")
# print(len(X_pca))
df_w_ml = pd.read_pickle("df_w_cleaned.pkl")
y = df_w_ml["quality"]
# print(len(y))
# X_pca as feature/input
# y as output
# train and test split
test_size_ratio = 0.3
X_logreg_train, X_logreg_test, y_logreg_train, y_logreg_test = train_test_split(X_pca, y, test_size=test_size_ratio, random_state=42)
# data describe, if needed
# X_logreg_train.describe()
# X_logreg_test.describe()
# y_logreg_train.describe()
# y_logreg_test.describe()
# logistic regression model
log_reg = LogisticRegression(max_iter=50_000)
ticktime = dt.datetime.now()
log_reg.fit(X_logreg_train, y_logreg_train)
log_reg_duration = dt.datetime.now() - ticktime
log_reg
LogisticRegression(max_iter=50000)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression(max_iter=50000)
# scoring
log_reg_train_score = log_reg.score(X_logreg_train, y_logreg_train)
log_reg_test_score = log_reg.score(X_logreg_test, y_logreg_test)
print(f"logistic regression training score: {log_reg_train_score} ")
print(f"logistic regression test score: {log_reg_test_score} ")
print()
print(f"logistic regression duration: {log_reg_duration}")
logistic regression training score: 0.5489548954895489 logistic regression test score: 0.5111111111111111 logistic regression duration: 0:00:00.103729
y_logreg_pred = log_reg.predict(X_logreg_test)
# confusion matrix
log_reg_conf_matrix = confusion_matrix(y_logreg_test, y_logreg_pred)
log_reg_conf_matrix
# Erzeuge die Heatmap
# plt.figure(figsize=(10, 8))
# sns.heatmap(log_reg_conf_matrix, annot=True, cmap='coolwarm', linewidths=0.5, annot_kws={"size": 9})
# plt.title("Confusion Matrix Heatmap: Weisswein")
# plt.show()
array([[ 0, 0, 1, 2, 1, 0, 0],
[ 0, 4, 27, 18, 0, 0, 0],
[ 0, 1, 160, 168, 2, 1, 0],
[ 0, 0, 103, 383, 47, 1, 1],
[ 0, 0, 12, 137, 51, 0, 0],
[ 0, 0, 1, 30, 17, 0, 0],
[ 0, 0, 0, 0, 2, 0, 0]], dtype=int64)
# classification_report
print(classification_report(y_logreg_test, y_logreg_pred))
precision recall f1-score support
3 0.00 0.00 0.00 4
4 0.80 0.08 0.15 49
5 0.53 0.48 0.50 332
6 0.52 0.72 0.60 535
7 0.42 0.26 0.32 200
8 0.00 0.00 0.00 48
9 0.00 0.00 0.00 2
accuracy 0.51 1170
macro avg 0.32 0.22 0.22 1170
weighted avg 0.49 0.51 0.48 1170
# cross validation score
log_reg_cross_val_train_score = cross_val_score(log_reg, X_logreg_train, y_logreg_train, cv=7, scoring="precision_macro")
print("logistic regression cross-validation training score: \n", log_reg_cross_val_train_score, type(log_reg_cross_val_train_score))
print()
print("mean: ", log_reg_cross_val_train_score.mean())
print("standard deviation: ", log_reg_cross_val_train_score.std())
print("min: ", log_reg_cross_val_train_score.min())
print("max: ", log_reg_cross_val_train_score.max())
logistic regression cross-validation training score: [0.42613432 0.35934024 0.223941 0.23195844 0.3632031 0.36945034 0.34255309] <class 'numpy.ndarray'> mean: 0.33094007564707967 standard deviation: 0.06946066251421522 min: 0.22394100004108636 max: 0.42613431828408577
# cross validation detailed
log_reg_cross_val_train_score_1 = cross_validate(log_reg, X_logreg_train, y_logreg_train, cv=7, scoring="precision_macro")
# print(log_reg_cross_val_train_score_1)
log_reg_cross_val_train_score_1
{'fit_time': array([0.088274 , 0.09026575, 0.0947516 , 0.09525394, 0.08975983,
0.08237576, 0.08078313]),
'score_time': array([0.0039897 , 0.00498652, 0.00199461, 0.00299168, 0.00498724,
0.00495028, 0.00199485]),
'test_score': array([0.42613432, 0.35934024, 0.223941 , 0.23195844, 0.3632031 ,
0.36945034, 0.34255309])}
# grid search with cross validation
parameters = [
{"solver": ["lbfgs"], "penalty": ["l2"], "multi_class": ["ovr", "multinomial"]},
{"solver": ["saga"], "penalty": ["l2", "l1"], "multi_class": ["ovr", "multinomial"]}
]
log_reg_grid = GridSearchCV(log_reg, parameters, cv=7, scoring="precision_macro")
log_reg_grid
GridSearchCV(cv=7, estimator=LogisticRegression(max_iter=50000),
param_grid=[{'multi_class': ['ovr', 'multinomial'],
'penalty': ['l2'], 'solver': ['lbfgs']},
{'multi_class': ['ovr', 'multinomial'],
'penalty': ['l2', 'l1'], 'solver': ['saga']}],
scoring='precision_macro')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=7, estimator=LogisticRegression(max_iter=50000),
param_grid=[{'multi_class': ['ovr', 'multinomial'],
'penalty': ['l2'], 'solver': ['lbfgs']},
{'multi_class': ['ovr', 'multinomial'],
'penalty': ['l2', 'l1'], 'solver': ['saga']}],
scoring='precision_macro')LogisticRegression(max_iter=50000)
LogisticRegression(max_iter=50000)
# learning
fitting = log_reg_grid.fit(X_logreg_train, y_logreg_train)
fitting
GridSearchCV(cv=7, estimator=LogisticRegression(max_iter=50000),
param_grid=[{'multi_class': ['ovr', 'multinomial'],
'penalty': ['l2'], 'solver': ['lbfgs']},
{'multi_class': ['ovr', 'multinomial'],
'penalty': ['l2', 'l1'], 'solver': ['saga']}],
scoring='precision_macro')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=7, estimator=LogisticRegression(max_iter=50000),
param_grid=[{'multi_class': ['ovr', 'multinomial'],
'penalty': ['l2'], 'solver': ['lbfgs']},
{'multi_class': ['ovr', 'multinomial'],
'penalty': ['l2', 'l1'], 'solver': ['saga']}],
scoring='precision_macro')LogisticRegression(max_iter=50000)
LogisticRegression(max_iter=50000)
# GridSearch: detailierte Ergebnis
log_reg_grid.cv_results_
{'mean_fit_time': array([0.03063107, 0.08651478, 0.13185099, 0.18525451, 0.26847925,
0.43959079]),
'std_fit_time': array([0.00541695, 0.0098965 , 0.01705025, 0.01636275, 0.03787587,
0.06428882]),
'mean_score_time': array([0.00321542, 0.00351344, 0.00248245, 0.00284532, 0.00243143,
0.00256692]),
'std_score_time': array([0.00085013, 0.00112054, 0.00045268, 0.00034815, 0.00048639,
0.00041803]),
'param_multi_class': masked_array(data=['ovr', 'multinomial', 'ovr', 'ovr', 'multinomial',
'multinomial'],
mask=[False, False, False, False, False, False],
fill_value='?',
dtype=object),
'param_penalty': masked_array(data=['l2', 'l2', 'l2', 'l1', 'l2', 'l1'],
mask=[False, False, False, False, False, False],
fill_value='?',
dtype=object),
'param_solver': masked_array(data=['lbfgs', 'lbfgs', 'saga', 'saga', 'saga', 'saga'],
mask=[False, False, False, False, False, False],
fill_value='?',
dtype=object),
'params': [{'multi_class': 'ovr', 'penalty': 'l2', 'solver': 'lbfgs'},
{'multi_class': 'multinomial', 'penalty': 'l2', 'solver': 'lbfgs'},
{'multi_class': 'ovr', 'penalty': 'l2', 'solver': 'saga'},
{'multi_class': 'ovr', 'penalty': 'l1', 'solver': 'saga'},
{'multi_class': 'multinomial', 'penalty': 'l2', 'solver': 'saga'},
{'multi_class': 'multinomial', 'penalty': 'l1', 'solver': 'saga'}],
'split0_test_score': array([0.26287543, 0.42613432, 0.26287543, 0.26167161, 0.42613432,
0.42490202]),
'split1_test_score': array([0.38131963, 0.35934024, 0.38131963, 0.38131963, 0.35934024,
0.35934024]),
'split2_test_score': array([0.23139848, 0.223941 , 0.23139848, 0.23139848, 0.223941 ,
0.22593833]),
'split3_test_score': array([0.24407744, 0.23195844, 0.24407744, 0.24407744, 0.23195844,
0.23199443]),
'split4_test_score': array([0.29674852, 0.3632031 , 0.29674852, 0.2957858 , 0.3632031 ,
0.44564918]),
'split5_test_score': array([0.3159459 , 0.36945034, 0.3159459 , 0.31642323, 0.36945034,
0.3610583 ]),
'split6_test_score': array([0.24916849, 0.34255309, 0.24916849, 0.25505051, 0.34255309,
0.34255309]),
'mean_test_score': array([0.28307627, 0.33094008, 0.28307627, 0.28367525, 0.33094008,
0.34163366]),
'std_test_score': array([0.04879756, 0.06946066, 0.04879756, 0.04834091, 0.06946066,
0.07914787]),
'rank_test_score': array([5, 2, 5, 4, 2, 1])}
# bestes Model
best_log_reg = log_reg_grid.best_estimator_
best_log_reg
LogisticRegression(max_iter=50000, multi_class='multinomial', penalty='l1',
solver='saga')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. LogisticRegression(max_iter=50000, multi_class='multinomial', penalty='l1',
solver='saga')best_training_score = best_log_reg.score(X_logreg_train, y_logreg_train)
best_test_score = best_log_reg.score(X_logreg_test, y_logreg_test)
print("best training score: ", best_training_score)
print("best test score: ", best_test_score)
best training score: 0.5493215988265493 best test score: 0.5111111111111111
y_logreg_best_pred = best_log_reg.predict(X_logreg_test)
# confusion matrix
confusion_matrix(y_logreg_test, y_logreg_best_pred)
array([[ 0, 0, 1, 2, 1, 0, 0],
[ 0, 4, 27, 18, 0, 0, 0],
[ 0, 1, 160, 168, 2, 1, 0],
[ 1, 0, 103, 383, 47, 1, 0],
[ 0, 0, 12, 137, 51, 0, 0],
[ 0, 0, 1, 30, 17, 0, 0],
[ 0, 0, 0, 0, 2, 0, 0]], dtype=int64)
# classification_report
print(classification_report(y_logreg_test, y_logreg_best_pred))
precision recall f1-score support
3 0.00 0.00 0.00 4
4 0.80 0.08 0.15 49
5 0.53 0.48 0.50 332
6 0.52 0.72 0.60 535
7 0.42 0.26 0.32 200
8 0.00 0.00 0.00 48
9 0.00 0.00 0.00 2
accuracy 0.51 1170
macro avg 0.32 0.22 0.22 1170
weighted avg 0.49 0.51 0.48 1170
# get data
df_w_ml3 = pd.read_pickle("df_w_ml3.pkl")
y = df_w_ml3["quality"]
X = df_w_ml3.drop(columns=["quality"])
# train and test split
test_size_ratio = 0.3
X_bayes_train, X_bayes_test, y_bayes_train, y_bayes_test = train_test_split(X, y, test_size=test_size_ratio, random_state=42)
# data describe, if needed
# X_bayes_train.describe()
# X_bayes_test.describe()
# y_bayes_train.describe()
# y_bayes_test.describe()
# Gaussian Naive Bayes
myBayes = GaussianNB()
ticktime = dt.datetime.now()
myBayes.fit(X_bayes_train, y_bayes_train)
bayes_duration = dt.datetime.now() - ticktime
myBayes
GaussianNB()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
GaussianNB()
# scoring
bayes_test_score = myBayes.score(X_bayes_test, y_bayes_test)
bayes_training_score = myBayes.score(X_bayes_train, y_bayes_train)
print(f"naive Bayes training score: {bayes_training_score} ")
print(f"naive Bayes test score: {bayes_test_score} ")
print()
print(f"naive Bayes duration: {bayes_duration}")
naive Bayes training score: 0.7040704070407041 naive Bayes test score: 0.6914529914529914 naive Bayes duration: 0:00:00.002980
y_bayes_pred = myBayes.predict(X_bayes_test)
# confusion matrix
confusion_matrix(y_bayes_test, y_bayes_pred)
array([[ 13, 34, 6],
[ 28, 625, 214],
[ 3, 76, 171]], dtype=int64)
# classification_report
target_names = ["schlechte Wein", "typische Wein", "gute Wein"]
print(classification_report(y_bayes_test, y_bayes_pred, target_names=target_names))
precision recall f1-score support
schlechte Wein 0.30 0.25 0.27 53
typische Wein 0.85 0.72 0.78 867
gute Wein 0.44 0.68 0.53 250
accuracy 0.69 1170
macro avg 0.53 0.55 0.53 1170
weighted avg 0.74 0.69 0.70 1170
# cross validation score
bayes_cross_val_train_score = cross_val_score(myBayes, X_bayes_train, y_bayes_train, cv=7, scoring="precision_macro")
print("naive Bayes cross-validation training score: \n", bayes_cross_val_train_score, type(bayes_cross_val_train_score))
print()
print("mean: ", bayes_cross_val_train_score.mean())
print("standard deviation: ", bayes_cross_val_train_score.std())
print("min: ", bayes_cross_val_train_score.min())
print("max: ", bayes_cross_val_train_score.max())
naive Bayes cross-validation training score: [0.53107641 0.52910053 0.4877182 0.56860837 0.57268398 0.54887328 0.54281901] <class 'numpy.ndarray'> mean: 0.5401256824702269 standard deviation: 0.02647119185518833 min: 0.4877182018914303 max: 0.5726839826839827
# cross validation detailed
bayes_cross_val_train_score_1 = cross_validate(myBayes, X_bayes_train, y_bayes_train, cv=7, scoring="precision_macro")
# print(bayes_cross_val_train_score_1)
bayes_cross_val_train_score_1
{'fit_time': array([0.00196338, 0.00201368, 0.00302672, 0.00296617, 0.00199604,
0.00199556, 0.00200295]),
'score_time': array([0.00299001, 0.00297356, 0.00198579, 0.00200415, 0.00201845,
0.00202727, 0.00298333]),
'test_score': array([0.53107641, 0.52910053, 0.4877182 , 0.56860837, 0.57268398,
0.54887328, 0.54281901])}
# grid search with cross validation
parameters = {
'var_smoothing': [1e-9, 1e-8, 1e-7, 1e-6, 1e-5],
}
bayes_grid = GridSearchCV(myBayes, parameters, cv=7, scoring="precision_macro")
bayes_grid
GridSearchCV(cv=7, estimator=GaussianNB(),
param_grid={'var_smoothing': [1e-09, 1e-08, 1e-07, 1e-06, 1e-05]},
scoring='precision_macro')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=7, estimator=GaussianNB(),
param_grid={'var_smoothing': [1e-09, 1e-08, 1e-07, 1e-06, 1e-05]},
scoring='precision_macro')GaussianNB()
GaussianNB()
# learning
fitting = bayes_grid.fit(X_bayes_train, y_bayes_train)
fitting
GridSearchCV(cv=7, estimator=GaussianNB(),
param_grid={'var_smoothing': [1e-09, 1e-08, 1e-07, 1e-06, 1e-05]},
scoring='precision_macro')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=7, estimator=GaussianNB(),
param_grid={'var_smoothing': [1e-09, 1e-08, 1e-07, 1e-06, 1e-05]},
scoring='precision_macro')GaussianNB()
GaussianNB()
# GridSearch: detailierte Ergebnis
bayes_grid.cv_results_
{'mean_fit_time': array([0.00257707, 0.00213521, 0.00197642, 0.00216821, 0.00199533]),
'std_fit_time': array([7.19951260e-04, 3.50554027e-04, 5.22792921e-05, 3.44084055e-04,
2.56173603e-05]),
'mean_score_time': array([0.00312209, 0.00258643, 0.00241995, 0.00242649, 0.00242104]),
'std_score_time': array([0.00034104, 0.00051076, 0.00050172, 0.00048995, 0.00049447]),
'param_var_smoothing': masked_array(data=[1e-09, 1e-08, 1e-07, 1e-06, 1e-05],
mask=[False, False, False, False, False],
fill_value='?',
dtype=object),
'params': [{'var_smoothing': 1e-09},
{'var_smoothing': 1e-08},
{'var_smoothing': 1e-07},
{'var_smoothing': 1e-06},
{'var_smoothing': 1e-05}],
'split0_test_score': array([0.53107641, 0.52968483, 0.5301918 , 0.55160965, 0.49831515]),
'split1_test_score': array([0.52910053, 0.54005505, 0.55288949, 0.56118914, 0.55324074]),
'split2_test_score': array([0.4877182 , 0.47625165, 0.47684077, 0.48954788, 0.58411312]),
'split3_test_score': array([0.56860837, 0.57128205, 0.58216503, 0.5890323 , 0.42769466]),
'split4_test_score': array([0.57268398, 0.58602757, 0.58855857, 0.58397866, 0.66383838]),
'split5_test_score': array([0.54887328, 0.55220386, 0.55059493, 0.55995751, 0.64869787]),
'split6_test_score': array([0.54281901, 0.54391744, 0.55079525, 0.59022697, 0.60818282]),
'mean_test_score': array([0.54012568, 0.54277463, 0.54743369, 0.56079173, 0.56915468]),
'std_test_score': array([0.02647119, 0.03246482, 0.03428707, 0.03243113, 0.07775557]),
'rank_test_score': array([5, 4, 3, 2, 1])}
# bestes Model
best_bayes = bayes_grid.best_estimator_
best_bayes
GaussianNB(var_smoothing=1e-05)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
GaussianNB(var_smoothing=1e-05)
best_training_score = best_bayes.score(X_bayes_train, y_bayes_train)
best_test_score = best_bayes.score(X_bayes_test, y_bayes_test)
print("best training score: ", best_training_score)
print("best test score: ", best_test_score)
best training score: 0.7389072240557389 best test score: 0.7282051282051282
y_bayes_best_pred = best_bayes.predict(X_bayes_test)
# confusion matrix
confusion_matrix(y_bayes_test, y_bayes_best_pred)
array([[ 3, 47, 3],
[ 10, 714, 143],
[ 0, 115, 135]], dtype=int64)
# classification_report
target_names = ["schlechte Wein", "typische Wein", "gute Wein"]
print(classification_report(y_bayes_test, y_bayes_best_pred, target_names=target_names))
precision recall f1-score support
schlechte Wein 0.23 0.06 0.09 53
typische Wein 0.82 0.82 0.82 867
gute Wein 0.48 0.54 0.51 250
accuracy 0.73 1170
macro avg 0.51 0.47 0.47 1170
weighted avg 0.72 0.73 0.72 1170
# get data
df_w_ml = pd.read_pickle("df_w_cleaned.pkl")
y = df_w_ml["quality"]
X = df_w_ml.drop(columns=["quality"])
# train and test split
test_size_ratio = 0.3
X_bayes_train, X_bayes_test, y_bayes_train, y_bayes_test = train_test_split(X, y, test_size=test_size_ratio, random_state=42)
# data describe, if needed
# X_bayes_train.describe()
# X_bayes_test.describe()
# y_bayes_train.describe()
# y_bayes_test.describe()
# Gaussian Naive Bayes
myBayes = GaussianNB()
ticktime = dt.datetime.now()
myBayes.fit(X_bayes_train, y_bayes_train)
bayes_duration = dt.datetime.now() - ticktime
myBayes
GaussianNB()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
GaussianNB()
# scoring
bayes_test_score = myBayes.score(X_bayes_test, y_bayes_test)
bayes_training_score = myBayes.score(X_bayes_train, y_bayes_train)
print(f"naive Bayes training score: {bayes_training_score} ")
print(f"naive Bayes test score: {bayes_test_score} ")
print()
print(f"naive Bayes duration: {bayes_duration}")
naive Bayes training score: 0.48588192152548587 naive Bayes test score: 0.44871794871794873 naive Bayes duration: 0:00:00.002960
y_bayes_pred = myBayes.predict(X_bayes_test)
# confusion matrix
confusion_matrix(y_bayes_test, y_bayes_pred)
array([[ 1, 0, 1, 1, 1, 0, 0],
[ 1, 10, 20, 15, 3, 0, 0],
[ 1, 14, 175, 115, 27, 0, 0],
[ 2, 11, 141, 215, 165, 1, 0],
[ 0, 3, 16, 56, 122, 3, 0],
[ 0, 1, 2, 14, 29, 2, 0],
[ 0, 0, 0, 0, 2, 0, 0]], dtype=int64)
# classification_report
print(classification_report(y_bayes_test, y_bayes_pred))
precision recall f1-score support
3 0.20 0.25 0.22 4
4 0.26 0.20 0.23 49
5 0.49 0.53 0.51 332
6 0.52 0.40 0.45 535
7 0.35 0.61 0.44 200
8 0.33 0.04 0.07 48
9 0.00 0.00 0.00 2
accuracy 0.45 1170
macro avg 0.31 0.29 0.28 1170
weighted avg 0.46 0.45 0.44 1170
# cross validation score
bayes_cross_val_train_score = cross_val_score(myBayes, X_bayes_train, y_bayes_train, cv=7, scoring="precision_macro")
print("naive Bayes cross-validation training score: \n", bayes_cross_val_train_score, type(bayes_cross_val_train_score))
print()
print("mean: ", bayes_cross_val_train_score.mean())
print("standard deviation: ", bayes_cross_val_train_score.std())
print("min: ", bayes_cross_val_train_score.min())
print("max: ", bayes_cross_val_train_score.max())
naive Bayes cross-validation training score: [0.29334168 0.22281052 0.22146567 0.45392706 0.29020626 0.36175642 0.31836442] <class 'numpy.ndarray'> mean: 0.3088388618572753 standard deviation: 0.0751796668909537 min: 0.22146566509394194 max: 0.45392705702727937
# cross validation detailed
bayes_cross_val_train_score_1 = cross_validate(myBayes, X_bayes_train, y_bayes_train, cv=7, scoring="precision_macro")
# print(bayes_cross_val_train_score_1)
bayes_cross_val_train_score_1
{'fit_time': array([0.003016 , 0.00299311, 0.00299168, 0.00296474, 0.00339818,
0.00296831, 0.00199628]),
'score_time': array([0.00296736, 0.00299144, 0.00302601, 0.00298524, 0.00261021,
0.00301552, 0.0029912 ]),
'test_score': array([0.29334168, 0.22281052, 0.22146567, 0.45392706, 0.29020626,
0.36175642, 0.31836442])}
# grid search with cross validation
parameters = {
'var_smoothing': [1e-9, 1e-8, 1e-7, 1e-6, 1e-5],
}
bayes_grid = GridSearchCV(myBayes, parameters, cv=7, scoring="precision_macro")
bayes_grid
GridSearchCV(cv=7, estimator=GaussianNB(),
param_grid={'var_smoothing': [1e-09, 1e-08, 1e-07, 1e-06, 1e-05]},
scoring='precision_macro')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=7, estimator=GaussianNB(),
param_grid={'var_smoothing': [1e-09, 1e-08, 1e-07, 1e-06, 1e-05]},
scoring='precision_macro')GaussianNB()
GaussianNB()
# learning
fitting = bayes_grid.fit(X_bayes_train, y_bayes_train)
fitting
GridSearchCV(cv=7, estimator=GaussianNB(),
param_grid={'var_smoothing': [1e-09, 1e-08, 1e-07, 1e-06, 1e-05]},
scoring='precision_macro')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=7, estimator=GaussianNB(),
param_grid={'var_smoothing': [1e-09, 1e-08, 1e-07, 1e-06, 1e-05]},
scoring='precision_macro')GaussianNB()
GaussianNB()
# GridSearch: detailierte Ergebnis
bayes_grid.cv_results_
{'mean_fit_time': array([0.00257094, 0.0022801 , 0.00200091, 0.00233497, 0.00230922]),
'std_fit_time': array([4.90061201e-04, 4.49318879e-04, 3.33440020e-05, 4.26588946e-04,
4.37670016e-04]),
'mean_score_time': array([0.00269546, 0.00256535, 0.00284341, 0.00306225, 0.00241736]),
'std_score_time': array([0.00043924, 0.00050436, 0.0003607 , 0.00019221, 0.00048818]),
'param_var_smoothing': masked_array(data=[1e-09, 1e-08, 1e-07, 1e-06, 1e-05],
mask=[False, False, False, False, False],
fill_value='?',
dtype=object),
'params': [{'var_smoothing': 1e-09},
{'var_smoothing': 1e-08},
{'var_smoothing': 1e-07},
{'var_smoothing': 1e-06},
{'var_smoothing': 1e-05}],
'split0_test_score': array([0.29334168, 0.29350718, 0.29811809, 0.31541091, 0.3790066 ]),
'split1_test_score': array([0.22281052, 0.22854639, 0.23468513, 0.23372491, 0.24903816]),
'split2_test_score': array([0.22146567, 0.22349914, 0.22131646, 0.22294343, 0.39692997]),
'split3_test_score': array([0.45392706, 0.4510854 , 0.44903309, 0.40589827, 0.31978109]),
'split4_test_score': array([0.29020626, 0.2873879 , 0.3003041 , 0.3049138 , 0.29748051]),
'split5_test_score': array([0.36175642, 0.33916621, 0.3172129 , 0.31887551, 0.33269474]),
'split6_test_score': array([0.31836442, 0.31831518, 0.3215375 , 0.3212973 , 0.334456 ]),
'mean_test_score': array([0.30883886, 0.30592963, 0.30602961, 0.30329488, 0.32991244]),
'std_test_score': array([0.07517967, 0.07132354, 0.06884484, 0.05674511, 0.04570885]),
'rank_test_score': array([2, 4, 3, 5, 1])}
# bestes Model
best_bayes = bayes_grid.best_estimator_
best_bayes
GaussianNB(var_smoothing=1e-05)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
GaussianNB(var_smoothing=1e-05)
best_training_score = best_bayes.score(X_bayes_train, y_bayes_train)
best_test_score = best_bayes.score(X_bayes_test, y_bayes_test)
print("best training score: ", best_training_score)
print("best test score: ", best_test_score)
best training score: 0.48624862486248627 best test score: 0.4700854700854701
y_bayes_best_pred = best_bayes.predict(X_bayes_test)
# confusion matrix
confusion_matrix(y_bayes_test, y_bayes_best_pred)
array([[ 1, 0, 1, 1, 1, 0, 0],
[ 1, 6, 19, 21, 2, 0, 0],
[ 2, 9, 161, 147, 13, 0, 0],
[ 1, 5, 140, 279, 110, 0, 0],
[ 0, 0, 13, 84, 103, 0, 0],
[ 0, 0, 1, 24, 23, 0, 0],
[ 0, 0, 0, 0, 2, 0, 0]], dtype=int64)
# classification_report
print(classification_report(y_bayes_test, y_bayes_best_pred))
precision recall f1-score support
3 0.20 0.25 0.22 4
4 0.30 0.12 0.17 49
5 0.48 0.48 0.48 332
6 0.50 0.52 0.51 535
7 0.41 0.52 0.45 200
8 0.00 0.00 0.00 48
9 0.00 0.00 0.00 2
accuracy 0.47 1170
macro avg 0.27 0.27 0.26 1170
weighted avg 0.45 0.47 0.46 1170